Intro to Statistics
Variables, Samples, Population, Data
Bogdan G. Popescu
John Cabot University
Variables
How are two or more variables related?
- An independent variable is thought to influence or cause variation in another variable
- A dependent variable depends upon or is caused by variation in the independent variable
Examples:
Independent Variable → Dependent Variable
Associations
Two variables are associated if knowing the value of one of them will help to predict the value of the other.
Example:
Life Expectancy and Urbanization
Life Expectancy and Urbanization
Life Expectancy
- Indicator of countries’ overall health, physical well-being
- Of interest to many, including health researchers, economists, sociologists, anthropologists…
We will examine UN data on average life expectancy for 214 countries.
We want to know if urbanization has a positive or negative relationship on life expectancy.
Life Expectancy and Urbanization
The Data
![]()
Life Expectancy vs Urbanization
- Countries are subjects, cases, units, or elements in the data set
- Two columns for life expectancy and urbanization
- These are variables or characteristics varying among units
Life Expectancy and Urbanization
- What explains variation in life expectancy?
- What characteristics do countries with longer life expectancy have in common?
- What characteristics do countries with shorter life expectancy have in common?
- Normally, we would be examining the relationship between different types of variables and life expectancy:
- income
- urbanization
- education
Correlates of Life Expectancy
![]()
Correlates of Life Expectancy
A scatterplot of life expectancy by urbanization
![]()
Scatterplot 1
Associations vs. Causal Relationships
- If two variables are associated, knowing the value of one helps predict the value of the other
- In this example, we would predict:
- A middle-income country would have a longer life expectancy than a low-income country
- A country with more urbanization would have longer life expectancy than one with less
Causal Relationships
A causal relationship entails three elements:
- The independent (X) and dependent variables (Y) covary
- The change in X precedes the change in Y
- The covariation between X and Y is not coincidental or spurious
Causal relationships can be stipulated in hypotheses.
Hypotheses
Relationships between variables can be stated in hypotheses.
A hypothesis is an explicit statement about the relationship between phenomena that formalizes the researcher’s informed guess.
Characteristics of Good Hypotheses
- Empirical statements that formulate educated guesses
- Logical reason to think data can confirm hypotheses
- Indicate direction of the relationship
- Terms must match testing methods
- Data should be feasible to obtain
- Must specify unit of analysis (individuals, orgs, states, etc.)
Examples of Hypotheses
People tend to adopt political viewpoints similar to their parents.
Democracies are more likely to engage in trade with one another.
Authoritarian regimes are more likely to violate human rights.
Countries where property rights are protected tend to have higher levels of development.
Concepts
Definitions of concepts should be:
- clear
- accurate
- precise
- informative
Concepts should strike a balance between the specific and the abstract.
Populations vs. Samples
Population – complete enumeration of some set of interest
To learn about the population, a sample is often studied
Sampling is the process of selecting a subset from the population
Sampling is used to estimate characteristics of the full population
Aim: Ensure sample is representative
Requirement: Know your population
Dominant approach: probability sampling
Populations vs. Samples
Representative sample – If repeated, the sample’s features would match those of the population on average
Probability sampling reduces sample selection bias and ensures representativeness
Data and Variables – Basics
Categorical
- Binary: e.g., 0 = unemployed, 1 = employed
- Nominal: Order does not matter (e.g., 0 = Green, 1 = Red, 3 = Blue)
- Ordinal: Order is meaningful (e.g., 0 = Poor, 1 = Fair, 2 = Good)
Data and Variables – Basics
Numerical
- Discrete: e.g., number of individuals in a household
- Continuous: e.g., height, weight, wages
Cross-Sectional Data
- Cross-sectional datasets have one observation per unit
- Data for one variable (attribute) measured in N countries is written as:
\[
\{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N}
\]
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
- Cross-sectional datasets have one observation per unit
- Example values for one variable (e.g., life expectancy):
\[
\{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N}
\]
\[
\{45.38333, 68.28611, 57.53013, \dots, 77.04861\} = \{X_i\}_{i=1,\dots,N}
\]
Cross-Sectional Data
- If we measure two attributes, we can represent them as a point in 2D space
- A single data point is a vector in two dimensions
Example:
- Life expectancy = 59.75
- Level of urbanization = 66.4
Then the data point is:
\[
X = [66.4, 59.75]
\]
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Evaluation of Empirical Propositions
Social scientists use statistical analyses to verify theories driven by carefully thought-out hypotheses.
Hypotheses are falsifiable claims about the world.
Hypotheses connect dependent variables to independent variables.
- Dependent variables: outcomes or things we want to explain
- Independent variables: factors that help explain the dependent variable
Example
Hypothesis:
An increase in X (independent variable) leads to an increase in Y (dependent variable).
Democratization Hypothesis:
More economic development is associated with higher levels of democracy.
To test this, we collect data on X and Y.
Units of analysis are the entities where our theory applies (e.g., countries, individuals, firms).
Datasets
When we collect the data, we input it into a spreadsheet, a tabular format.
This becomes a dataset.
A Dataset
A Dataset
A Dataset
In this example, there appears to be a positive relationship between X and Y.
- Not all high-X observations have high Y
- Not all low-X observations have low Y
To evaluate the relationship, we fit a line that best approximates the pattern in the data.
A Dataset
Cross-Sectional Data
Each country’s data is a point in a scatter plot.
If we measure three variables (e.g., life expectancy, urbanization, education),
we get a 3D point cloud:
Time-Series Data
- A time series of length T is written as:
\[
\{X_1, X_2, X_3, \dots, X_T\} = \{X_t\}_{t=1,\dots,T}
\]
- A time series is a sequence of data points indexed in time order
- It has a natural temporal ordering
- Time is the second attribute
Time-Series Data
Time-Series Data
Time-Series Data
Time-Series Data
Time-Series Data
Time-Series Data
Time-Series Data
Time-Series Data vs. Cross-Section
Time-Series Data vs. Cross-Section
Time-Series Data
This is depicted as a 2D scatter.
Time is one variable, and the value of interest is another.
So, each point in the time series is a pair: (time, value).
Time-Series Data
Time-Series and Cross-Section Data
The following is a cross-section of time-series data:
Time-Series and Cross-Section Data
Balanced Panel
![]()
Time-Series and Cross-Section Data
Unbalanced Panel
![]()
Time-Series and Cross-Section Data
Balanced Panel
![]()
Time-Series and Cross-Section Data
Unbalanced Panel
![]()
Conclusion
- Measurement quality depends on accuracy and precision
- Reliability: can we replicate results?
- Validity: does the measure reflect the concept?
- Variables can be categorical or numerical
- Data can be cross-sectional, time-series, or both (panel data)
- Panel data can be balanced or unbalanced